Search CORE

1,736,729 research outputs found

Classifying document types to enhance search and recommendations in digital libraries

Author: F Sebastiani
L Maaten van der
Y Aphinyanaphongs
Publication venue
Publication date: 13/07/2017
Field of study

In this paper, we address the problem of classifying documents available from the global network of (open access) repositories according to their type. We show that the metadata provided by repositories enabling us to distinguish research papers, thesis and slides are missing in over 60% of cases. While these metadata describing document types are useful in a variety of scenarios ranging from research analytics to improving search and recommender (SR) systems, this problem has not yet been sufficiently addressed in the context of the repositories infrastructure. We have developed a new approach for classifying document types using supervised machine learning based exclusively on text specific features. We achieve 0.96 F1-score using the random forest and Adaboost classifiers, which are the best performing models on our data. By analysing the SR system logs of the CORE [1] digital library aggregator, we show that users are an order of magnitude more likely to click on research papers and thesis than on slides. This suggests that using document types as a feature for ranking/filtering SR results in digital libraries has the potential to improve user experience.Comment: 12 pages, 21st International Conference on Theory and Practise of Digital Libraries (TPDL), 2017, Thessaloniki, Greec

arXiv.org e-Print Archive

Crossref

Unsupervised learning of document image types

Author: Curtis Dean Patrick
Publication venue: Digital Scholarship@UNLV
Publication date: 01/01/2007
Field of study

In a system where medical paper document images have been converted to a digital format by a scanning operation, understanding the document types that exists in this system could provide for vital data indexing and retrieval. In a system where millions of document images have been scanned, it is infeasible to expect a supervised based algorithm or a tedious (human based) effort to discover the document types. The most sensible and practical way to do that is an unsupervised algorithm. Many clustering techniques have been developed for unsupervised classification. Many rely on all data being presented at once, the number of clusters to be known, or both. Presented in this thesis is a clustering scheme that is a two-threshold based technique relying on a hierarchical decomposition of the features. On a subset of document images, it discovers document types at an acceptable level and confidently classifies unknown document images

University of Nevada, Las Vegas Repository

Folksonomies vs. Bag-of-Words: The Evaluation & Comparison of Different Types of Document Representations

Author: Gruzd Anatoliy
Publication venue
Publication date: 01/01/2006
Field of study

published or submitted for publicationis peer reviewe

Illinois Digital Environment for Access to Learning and Scholarship Repository

The University of Arizona

IVOA Recommendation: VOResource: an XML Encoding Schema for Resource Metadata Version 1.03

Author: Benson Kevin
Graham Matthew
Greene Gretchen
Group the IVOA Registry Working
Harrison Paul
Lemson Gerard
Linde Tony
Plante Raymond
Rixon Guy
Stebe Aurelien
Publication venue: 'Smithsonian Institution'
Publication date: 03/10/2011
Field of study

This document describes an XML encoding standard for IVOA Resource Metadata, referred to as VOResource. This schema is primarily intended to support interoperable registries used for discovering resources; however, any application that needs to describe resources may use this schema. In this document, we define the types and elements that make up the schema as representations of metadata terms defined in the IVOA standard, Resource Metadata for the Virtual Observatory [Hanicsh et al. 2004]. We also describe the general model for the schema and explain how it may be extended to add new metadata terms and describe more specific types of resources

arXiv.org e-Print Archive

Crossref

What makes papers visible on social media? An analysis of various document characteristics

Author: Costas Rodrigo
Haustein Stefanie
Larivière Vincent
Zahedi Zohreh
Publication venue
Publication date: 01/01/2016
Field of study

In this study we have investigated the relationship between different document characteristics and the number of Mendeley readership counts, tweets, Facebook posts, mentions in blogs and mainstream media for 1.3 million papers published in journals covered by the Web of Science (WoS). It aims to demonstrate that how factors affecting various social media-based indicators differ from those influencing citations and which document types are more popular across different platforms. Our results highlight the heterogeneous nature of altmetrics, which encompasses different types of uses and user groups engaging with research on social media.Comment: Presented at the 21th International Conference in Science & Technology Indicators (STI), 13-16, September, 2016, Valencia, Spai

arXiv.org e-Print Archive

Leiden University Scholary Publications

Rewrite based Verification of XML Updates

Author: Jacquemard Florent
Rusinowitch Michael
Publication venue
Publication date: 01/01/2009
Field of study

We consider problems of access control for update of XML documents. In the context of XML programming, types can be viewed as hedge automata, and static type checking amounts to verify that a program always converts valid source documents into also valid output documents. Given a set of update operations we are particularly interested by checking safety properties such as preservation of document types along any sequence of updates. We are also interested by the related policy consistency problem, that is detecting whether a sequence of authorized operations can simulate a forbidden one. We reduce these questions to type checking problems, solved by computing variants of hedge automata characterizing the set of ancestors and descendants of the initial document type for the closure of parameterized rewrite rules

arXiv.org e-Print Archive

CiteSeerX

HAL - Université de Franche-Comté

Crossref

INRIA a CCSD electronic archive server

A granular approach to web search result presentation

Author: Jose J.M.
Ruthven I.
White R.W.
Publication venue
Publication date: 01/01/2003
Field of study

In this paper we propose and evaluate interfaces for presenting the results of web searches. Sentences, taken from the top retrieved documents, are used as fine-grained representations of document content and, when combined in a ranked list, to provide a query-specific overview of the set of retrieved documents. Current search engine interfaces assume users examine such results document-by-document. In contrast our approach groups, ranks and presents the contents of the top ranked document set. We evaluate our hypotheses that the use of such an approach can lead to more effective web searching and to increased user satisfaction. Our evaluation, with real users and different types of information seeking scenario, showed, with statistical significance, that these hypotheses hold

University of Strathclyde Institutional Repository